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The objective of this article is to illustrate that text mining and qualitative 
research are epistemologically compatible. First, like many qualitative 
research approaches, such as grounded theory, text mining encourages 
open-mindedness and discourages preconceptions. Contrary to the 
popular belief that text mining is a linear and fully automated procedure, 
the text miner might add, delete, and revise the initial categories in an 
iterative fashion. Second, text mining is similar to content analysis, which 
also aims to extract common themes and threads by counting words. 
Although both of them utilize computer algorithms, text mining is 
characterized by its capability of processing natural languages. Last, the 
criteria of sound text mining adhere to those in qualitative research in 
terms of consistency and replicability. Key Words: Text Mining, Content 
Analysis, Exploratory Data Analysis, Natural Language Processing, 
Computational Linguistics, Grounded Theory, Reliability, and Validity 


Problem and Purpose 

With advances in computing technology, text mining has become an emerging 
research method in various fields, including bioinformatics (Cohen & Hersh, 2005; Kano 
et ah, 2009; Kostoff, Block, Stump, & Pfeil, 2004; Kostoff, Morse, & Oncu, 2007; 
Koussounadis, Redfem, & Jones, 2009; Vellay, Latimer, & Paillard, 2009; Winnenburg, 
Wachter, Plake, Dorns, & Schroeder, 2008; Yao, Evans, & Rzhetsky, 2009; Zaremba et 
ah, 2009), business (Consoli, 2009; Miller, 2005; Singh, Hu, & Roehl, 2007; Spangler et 
ah, 2009), engineering (Kostoff, Bedford, del Rio, Cortes, & Karypis, 2004; Kostoff & 
DeMarco, 2001; Kostoff et al., 2006; Kostoff, Karpouzian, & Malpohl, 2005), and 
education (Chen, Kinshuk, Wei, Chen, 2008; Huang, Chen, Luo, Chen, & Chuang, 2008; 
Lin, Hsieh, & Chuang, 2009). The preceding applications have a strong quantitative focus 
in the sense that the outcome variables can be clearly defined; nonetheless, some 
researchers have applied text mining into qualitative research projects, and view text 
mining as a viable qualitative research method (Camillo, Tosi, & Traldi, 2005; Hong, 
2009; Janasik, Honkela, & Bruun, 2009). 

The purpose of this article is to demonstrate that text mining and qualitative 
research are epistemologically compatible. Lirst, like many qualitative research 
approaches, including grounded theory, text mining encourages open-mindedness and 
discourages preconceptions (Vilkinas, 2008). Second, text mining is similar to content 
analysis, which is qualitative in essence (Lin et al., 2009). Last, the criteria of good text 
mining adhere to those in qualitative research in terms of reliability and validity 
(Krippendorff, 2004). 
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What is Qualitative? 

One may argue that text mining and qualitative methods are vastly different in 
nature because the former, which employs algorithms for counting words, is inherently a 
quantitative method. In response to this assertion, Krippendorff (2004) argued that text 
analysis is indeed qualitative. In his view, reading texts and counting words, regardless 
of whether it is performed by a human or a computer, does not remove the qualitative 
nature of the texts. As a matter of fact, today many qualitative researchers employ 
computer software modules as an aid. 

According to Janasik et al. (2009), the seemingly qualitative method of gathering 
data, such as observation, participation, document analysis, and interviews does not 
necessarily make a study qualitative. The qualitative attribute of a study resides not in 
the data collection method, but in the data type and in the method with which the data are 
analyzed. In their view, in a qualitative study the data should not be converted to 
numeric values, and mathematical and statistical tools should not be used in the analysis. 
Rather, the data are processed through systematization, categorization, and interpretation. 
The first part of the definition (data type as qualitative) is the same as that suggested by 
Krippendorff (2004), but the second part (the absence of mathematical and statistical 
tools) is debatable. It is doubtful whether this type of “purity” in methods is an essential 
feature of qualitative research. 

Consider the metaphor of photography. Some film-based photographers 
complained that digital photographers distort the authenticity of the captured images by 
digital manipulation, and thus digital photography is computer graphics rather than true 
photography. However, they overlook the fact that adding filters on the lenses and 
darkroom manipulation, such as burning and dodging, are also considered manipulation. 
There is no “purity” in any photographic process. By the same token, purity in the 
analytical process cannot be a criterion for demarcating quantitative and qualitative 
approaches. For example, when a quantitative researcher employs exploratory data 
analysis (EDA) and data visualization (DV) to detect a pattern, there is no “cut-off’ value 
or numeric standard to determine what constitutes a pattern. In this case, he or she must 
make a qualitative-based decision. It would be absurd to exclude EDA and DV from the 
realm of quantitative methodology just because qualitative elements are involved in the 
analytical process. Therefore, it is the conviction of the authors that the qualitative 
attribute of a research study should be associated with the data type. Although text 
mining involves counting words and appears to be a quantitative method, its data type is 
still qualitative. And in essence there are common grounds between text mining and 
other qualitative methods, such as grounded theory, which will be discussed next. 

Text mining and Grounded Theory 
Openness to Surprising Results 

Text mining is typically defined as a process of extracting useful information 
from document collections through the identification and exploration of interesting 
patterns (Feldman & Sanger, 2007). Similarly, grounded theory was developed to 
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explore the data with an open mind. Grounded theorists intend to identify categories, 
concepts, and constructs that explain a process, an action, or interaction about a 
substantive topic (Glaser, 1978, 1992; Glaser & Strauss, 1967). In alignment with 
grounded theory, in which preconceptions must be put aside, text mining requires open- 
mindedness of the miners in order to let the categories emerge from the data. Classical 
grounded theorists assert that a theory must be grounded on the data. Following this 
logic, Glaser asserted, “There is a need not to review any of the literature in the 
substantive area under study” (p. 31) However, it is impossible for any researcher, no 
matter how open the researcher is, to maintain “purity” or to be “uncontaminated” by any 
preconceptions. So-called “forcing,” which results from certain unconscious 
preconceptions, can occur when the researcher imposes certain tacit structures on the 
phenomenon under study and then the researcher fits the data into the existing 
interpretive framework (Janasik et al., 2009). 

As a remedy, initially text miners conduct the coding using automated algorithms 
and it restrains the researcher from making any premature decisions in the research 
process. It does not necessarily mean that text mining is truly “open-minded” or is 
superior to grounded theory. Development of the text mining algorithms necessitates 
some preconceptions about the proper classification method, but the user of the text 
mining software module is blind to these preconceptions of the programmer. 

Coding as an Iterative Process 

More importantly, both grounded theory and text mining utilize an iterative 
process. In the fonner, initial categories extracted from the data must be constantly 
compared against new data (Glaser & Strauss, 1967), and thus the researcher is open to 
the possibility that previous categories might be collapsed and revised, and new 
categories might be added. By the same token, a text mining algorithm is designed to 
leam from the data by revising the categories. However, if this learning and revision is 
performed by automated algorithms alone, how could it be related to openness of the 
researchers? It is important to point out that text mining is not characterized by complete 
automation. Kostoff et al. (2007) explicitly stated that text mining is not a substitute for 
the judgment of the researchers, and should serve as a supplement. Human judgment, 
especially qualified-based judgment, must be made at different points of the process. The 
text miner might re-classify some entries into different categories or delete some 
redundant categories when the software makes an obvious mistake. In addition, Hong 
(2009) asserted in order to unearth hidden insight, rare but meaningful data must be 
scrutinized for pattern recognition. The text miner must make a qualitative judgment to 
set the key words and decide on the constraints. Based upon the input from the human 
miner, the computer system extracts small but meaningful data and then passes the output 
to pattern recognition. Therefore, the miner must interact with the pattern generated from 
the system, and also set more rigorous constraints and keywords to narrow down the 
search space. This iterative process continues until an optimal solution is obtained. In 
this sense, the openness in text mining is just like that of grounded theory in tenns of 
employing an iterative process to look for new and unexpected categories. 
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Text Mining and Content Analysis 

As mentioned in the previous section, text mining aims at pattern recognition and 
does not test pre-fonnulated hypotheses or assume the existence of pre-established 
taxonomies. In this sense, text mining should be in alignment with exploratory data 
analysis (Hearst, 1999). In tenns of the exploratory character, text mining is closely 
related to content analysis (CA), which is a method of gathering, analyzing, and 
categorizing the content associated with psychological constructs without preconceptions. 
The data-driven categories are called inferred categories, which mean that they are 
inferred or emerge from the data (Vilkinas, 2008). 

Strategies and Examples of Content Analysis 

Different researchers might implement CA differently. Based on the framework 
of qualitative analytical procedures developed by Miles and Huberman (1994), 
Romanowski (2009) outlined the common strategies of qualitative content analysis as 
follows: (a) The researcher carefully examine the textual data and takes notes; (b) The 
researcher performs data reduction by selecting, focusing, and condensing the data in the 
way that could best answer the research questions; (c) The researcher organizes, arranges 
and displays the condensed data. Based on the display, the researcher identifies themes, 
patterns, connections, and omissions that could help answer the research questions. 
Further, quotations might be listed for supporting the themes and inter-connections 
among the themes. If necessary, categories could be added, deleted, and revised to 
maximize mutual exclusivity and exhaustiveness; and (d) The researcher revisits the data 
many times in order to verify, test, or confinn the themes and patterns identified. 

While Romanowski (2009) outlined the basic principle, a study conducted by Tsai 
(2009) illustrates how CA is implemented to investigate the experiences of occupational 
injury and illness among Chinese immigrant restaurant workers in the US. After 
conducting twenty-one interviews and three focus groups, the researcher read each 
transcript word by word and highlighted the texts that appeared to capture the injury or 
illness experiences. Additionally, notes were taken as reflections about the raw data. 
After all of the highlighted texts with codes were entered into a qualitative data 
management software module, the coded data were re-organized and displayed in print 
for within- and between-interviews comparisons. While examining the retrieved data 
segments, the researcher revised some codes or quotations. After going through an 
iterative process of recoding, the researcher integrated the final set of codes into 
meaningful categories based on how the codes were inter-connected. It was found that 
occupational illnesses among Chinese workers are closely tied to certain cultural 
concepts, such as the belief that illness is a consequence of broken hannony and balance. 

History of Content Analysis 

Historically, CA can be seen as a precursor to today’s text mining. While initially 
used in classifying religious hymns in 18th century Sweden (Smith, 2000), CA has its 
social science origins in both political science and psychology. In political science, CA 
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was heavily used in propaganda analysis (Lasswell et al, 1949), whereas in psychology it 
was associated with data analysis for personality tests (Russell & Stiles, 1979; Smith). 
While psychologists do not refer to this process as CA, the emphasis in examining the 
verbal text from these tests relies on analysis of the text to identify common themes, 
associations, and imagery, in which their importance is ranked by their frequency in 
contexts. 

Early development of personality psychology heavily relied on text analysis. To 
identify personal dispositions that are unique to individuals, Allport, the father of modern 
personality psychology, and his colleague Odbert (1936) counted 17,953 descriptive 
words in Webster’s New International Dictionary in order to extract descriptions of 
personality characteristics (Feist & Feist, 2006). In addition, psychology has historically 
used CA to explore small group communication (Bales, 1950) and dream interpretation 
(Hall & Van de Castle, 1966). As early as 1966, computers were also used to assist in 
CA. Stone, Dunphy, Smith, and Ogilvie (1966) created a computer program to provide 
basic statistics on word usage and categories for the words used, allowing for a basic, 
computer-based analysis of the inputted text. 

More recently, as computing power has increased, CA has been used to examine a 
variety of texts and settings, including unidentified written works to determine authorship 
(Smith, 2000), to distinguish between stories of women who were and were not victims 
of sexual abuse (Arkhurst, 1994), to investigate the psychological status of psychiatric 
patients (Oxman, Rosenberg, Schnurr, & Tucker, 1985), and to create a personality 
portrait of President Nixon based on his inauguration speech (Winter & Carlson, 1988). 
Based on a complex system of text analysis initially developed for use in analyzing 
results from psychological personality tests, researchers examining Nixon’s inaugural 
speech applied CA to the language used to create a personality profile of his achievement, 
affiliation-intimacy, and power motives. This profile was validated by six aides that 
worked closely with Nixon and used to explain the paradoxes in his behavior (such as 
comments made early in his life regarding honesty compared to his involvement in the 
Watergate scandal). In the last example, CA was used as a tool to write “psycho¬ 
history.” 

Interestingly, although CA and text mining possess many commonalities, 
researchers in the two fields rarely make references to each other. If a literature review is 
conducted using both keywords, one can find only a few articles that link CA to text 
mining (e.g., Lee & Hu, 2004; Lin et al., 2009). Lin et al. found that although CA is a 
popular method to study student discussion in course management systems, it is too 
labor-intensive for instructors. To save time and resources, they developed a text mining 
system to facilitate the automatic coding process. In their view, TM is a suitable 
replacement of CA. 

Role of Natural Language Processing in Text Mining 

Content analysis is not necessarily totally manual and labor-intensive. Before the 
emergence of text mining, use of statistical methods, aided by the availability of 
increasingly powerful computers, had applied to content analysis. This trend resulted in a 
form of text analysis that provides novel information about a passage, allowing more 
advanced theories and hypotheses to be drawn from the text (Manning & Schutze, 1999). 
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Well-known examples of software modules for text-based CA are WordStat (Provalis 
Research, 2006), HYPEresearch (Researchware, 2008), MAXQDA (VERBI Software, 
2007), and Nvivo (QSR International, 2007). Although both CA and text mining utilize 
computer algorithms, there is one major difference between the software packages for 
CA and those for text mining. At the present time, most CA-oriented software packages, 
as cited above, use statistics-based algorithms for counting words. On the other hand, 
natural language processing (NLP) plays an important role in text mining. While strict 
CA provides descriptive infonnation, text mining using NLP can uncover patterns and 
provide predictive information, based on a more sophisticated understanding of language. 

Natural language processing is a subfield of artificial intelligence (AI) and 
computational linguistics (CL), which focus on the automatic analysis of human language 
with use of algorithms that can handle “fuzzy” structures (Gelbukh, 2007; Jurafsky & 
Martin, 2000; Kao & Poteet, 2007; Mehler & Kohler, 2007). Based on AI and CL 
theories, NLP aids text mining in information retrieval (Singhal, 2001) and automatic 
summarization (Mani, 2001). Natural language processing aims to address the complexity 
and multiple connotations of natural languages. In varying contexts, a single word can 
mean different things. For example, “books” in the phrase “he books tickets” is different 
from the same word in “he reads books.” Relying on a computer to conduct text analysis 
could be dangerous if the software is not well-written. As a remedy, text mining employs 
NLP in an attempt to “understand” the data as though a human coder read the text. The 
NLP movement is inspired by Chomsky’s (1957) notion that there are universal syntactic 
structures that are common to all languages. Based on this notion, linguists and computer 
scientists believed that ruled-based algorithms could be developed to process languages. 
As a result, research in linguistics and the philosophy of language set the agenda for 
explorations in NLP. Besides the school of universal syntactic structures, NLP 
researchers also explore the viability of data-driven NLP, which are example-based rather 
than rule-based (Dale & Moisl, 2000). 

Data Sources of Content Analysis and Text Mining 

There is another major difference between CA and text mining: data sources. 
Historically, most CA studies have been concerned with sociological or psychological 
constructs while recent text mining applications span across many fields. However, it is 
by no means an inherent characteristic of CA. Not only are the techniques of CA not 
restricted to analyzing social sciences data, but also the data sources of CA are broader 
than those of text mining. Content analysis can be conducted on written text, transcribed 
speech, verbal interactions, visual images, nonverbal behaviors, sound events, or any 
other message type. For example, one of the seminal works of CA is film analysis during 
the 1920s and 1930s. At that time some Americans were concerned with obscene movie 
content and its effects on young people. As the research center of the film content 
analysis, Ohio State University sent coders to take notes in theaters while watching 
movies for later classification of the sex and crime scenes of the movies (Neuendorf, 
2002). Another masterpiece of CA was accomplished by British Intelligence during 
World War II. A group connected with the BBC systematically analyzed propaganda 
from radio broadcasts aired by the Axis of Power. Based on this analysis, the Allied 
Forces were able to forecast the deployment of German troops and the launch of V2 
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rockets (Krippendorff, 2004; Neuendorf). All these “non-text” data sources must be 
counted and analyzed by human coders. Even today no automated text mining software 
module is smart enough to “watch” a movie or “listen” to a radio broadcast. 

In summary, the history of CA and the examples of CA studies cited above 
indicate that text mining, by objective and essence, is similar to CA. Traditionally, it is 
perceived that CA is based on human coding while TM utilizes computerized coding. 
Today this demarcation is blurred. In the flowchart of content analysis illustrated by 
Neuendorf (2002), the researcher could choose either the human coding approach or the 
computer coding approach. However, in most cases the researcher might employ both. 
As mentioned before, a good text miner does not completely hand over the judgment to 
the automated computer system; rather, he or she might override the computer-coded 
results by adding, deleting, collapsing, and renaming certain categories. In this sense, a 
text miner is a content analyzer, and vice versa. 

Reliability and Validity in Text Mining and a Qualitative Approach 
Controversy of Reliability and Validity in Qualitative Research 

Krippendorff (2004) asserted that the most crucial form of reliability in text 
analysis is replicability, which means that a convergent result can be yielded from 
different coders at different points of time and under different circumstances. For him, 
reliability is a means rather than an end, because the purpose of obtaining reliable data is 
to make valid inferences. Simply stated, there is no point in counting unless the 
frequencies could lead to inferences regarding the subject matter. In short, Krippendorff 
contended that “validating evidence ...is the ultimate justification of content analysis” (p. 
30). Text miners also view reliability as a central issue of text analysis. For example, 
SPSS Inc. (2006), publisher of Text Analysis for Surveys, highlighted the benefit of 
computer-aided text analysis by saying “reliability of results increases dramatically, since 
extraction and categorization are always performed in a consistent and repeatable 
manner” (p. 3). Although the preceding assertions are well-intended, it might be 
disputable in the context of the qualitative paradigm. By definition “inference” is an act 
of expanding the conclusion from a smaller subset to a broader set (e.g., from the sample 
statistics to the population parameter), but most qualitative studies do not aim to make 
“valid inferences.” While the meanings of reliability and validity are standardized in 
quantitative research (e.g., internal consistency, temporal stability, form equivalence, 
inter-rater reliability, content validity, criterion validity, and construct validity), the usage 
of reliability and validity in qualitative research are diverse and controversial. As a result, 
one might wonder what type of reliability could be used as the criterion for assessing a 
text mining study. It is the conviction of the authors that text mining, which emphasizes 
reliability in the form of consistency and replicability, is highly compatible with the 
qualitative paradigm. 

Originally, the concepts of “reliability” and “validity” were introduced by 
quantitative methodologists in an attempt to preserve the scientific merits of research 
studies. While certain qualitative researchers accept these criteria (Morse, 1999), some 
hold skeptical attitudes toward these concepts (Altheide & Johnson, 1998; Guba & 
Fincoln, 1985, 1989; Fincoln & Guba, 1985). A qualitative project is typically regarded 
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as a contextualized study, and thus generalizability and reliability in terms of replicability 
are not accepted as standards of rigorous research by some qualitative researchers. In 
addition, many qualitative researchers question the use of reliability and validity in 
qualitative research on the ground that these are “positivist” or “logical-positivist” 
concepts, which are based upon the recognition of an “objective reality” and the goal of 
seeking causal relationships (Golafshani, 2003; Guba & Lincoln, 1989), and that a 
quantitative approach would “fragment and delimit phenomena” (Golafshani, p. 598). 
An argument that is commonly used to question the conventional sense of reliability is 
that absolute objectivity, which is based upon the premise of an objective reality, is 
delusional (Niemann et ah, 2000). Under careful scrutiny, one can see that negating a 
notion by saying that the ideal state (absolute objectivity) can never be achieved is not a 
good strategy at all. Simply put, we cannot absolutely cure all diseases, but it does not 
mean that medical researchers who devote efforts to finding better cures are delusional, 
or that it is better to leave germs and bacteria unchecked, for they continue to exist 
anyway. 

Positivism as a Straw-Man 

Further, the anti-positivist argument is nothing more than attacking a straw-man, 
because positivists did not subscribe to the preceding views. For example, although 
positivist Schlick (1959) stated that reality refers to experience, Schlick (1925/1974) did 
not maintain that there is a direct path from sense experience to genuine knowledge 
because immediate contact with the given is both fleeting and subjective. Contrary to 
popular belief, some logical positivists are anti-realists. Even those logical positivists 
who accept a realist position do not regard the aim of science as finding the objective 
truth corresponding to the objective reality. Instead, they view inquiry as a convention 
for conveniences. The most well known brand of conventionalism is Carnap’s linguistic 
conventionalism (Carnap, 1937). In addition, the meaning of causation has been 
approached by different schools of thought. One of these approaches believes that 
causation involves a producing or forcing phenomenon (If X is a cause of Y, a change of 
X produces or forces a change in Y; Blalock, 1964). However, this view is incompatible 
with logical positivism’s perspective that “cause,” as an invisible force or a theoretical 
entity, cannot be observed or measured. In brief, according to “verificationism” proposed 
by logical positivists, statements that cannot be verified had no content. In this view, 
causal statements are non-verifiable statements (Schuldenfrei, 1972; Yu, 2006). In short, 
questioning the value of reliability and validity for their alleged association with 
positivism is problematic. 

Alternate Terms do not Introduce New Information 

While Lincoln and Guba (1985) asserted that the conventional benchmarks based 
on reliability and validity do not fit into the assumption of multiple constructed realities 
in qualitative research, different alternatives have been proposed, such as trustworthiness, 
rigor, quality (Golafshani, 2003), credibility, neutrality, confirmability, dependability, 
applicability, transferability (Lincoln & Guba, 1985), complexity, and consensus (Hall & 
Stevens, 1991), However, Long and Johnson (2000) found that there is nothing to be 
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gained from the use of alternative terms. Actually, they are often shown to be the same 
as the traditional terms of reliability and validity. For example, Guba and Lincoln (1989) 
defined “dependability” as the stability of data over time (p. 242). Indeed, the very 
essence of dependability is the same as that of reliability: “to ensure that data collection is 
undertaken in a consistent manner free from undue variation which unknowingly exerts 
an effect on the nature of the data” (Long & Johnson, p. 31). In short, it is not a novel 
conceptualization at all. Rather than putting aside the issues of reliability and validity or 
renaming them, qualitative researchers should take them into account in terms of their 
original meanings 

Text Mining Improves Consistency and Replicability 

A high degree of subjectivity in coding open-ended responses has drawn 
researchers’ attention to the issue of inter-rater reliability in qualitative research 
(Armstrong, Gosling, Weinman, & Martaeu, 1997; Moret, Reuzel, van der Wilt, & Grin, 
2007; Thompson, McCaughan, Cullum, Sheldon, & Raynor, 2004). Some critics 
expressed concerns that qualitative data analysis fails to provide replicable and 
generalizable conclusions (Carey, Morgan, & Oxtoby, 1996). Moret et al. are concerned 
with whether qualitative researchers involved in the same project can converge into the 
same interpretive framework. As a remedy, they conducted an inter-rater agreement 
analysis in the fashion of estimating reliability. However, some proponents assert that 
inter-rater reliability is applicable to semi-structured data only, in which all respondents 
answer the same question in the same format, but interpreting unstructured responses to 
interactive interviews should be conducted by the interviewers who know the messages 
the subjects intended to convey through their responses. Conversely, study results based 
upon member checks of coding would be decontextualized and abstracted from individual 
participants (Morse, 1997; Morse, Barratt, Mayan, Olson, & Spiers, 2002). 

It is important to point out that Morse et al., (2002) did not intend to reject the 
concepts of reliability and validity altogether. On the contrary, they see value in 
qualitative research but question the indispensability of inter-rater reliability. At the 
beginning of their article, Morse et al., explicitly state, “Without rigor, research is 
worthless, becomes fiction, and loses its utility” (p. 11). On another occasion, Morse 
(1999) asserted, “Rigorous research must be reliable and valid” (p. 717). Understandably, 
different coders might interpret the data differently whereas some coders are more 
familiar with the participants and the content. However, aside from inter-rater reliability 
and replicability, reliability can be characterized by consistency and test-retest reliability. 
Assuming that the person most familiar with the data performs the coding, is it 
reasonable to expect that a consistent scheme is evenly applied to all data by the same 
coder? In addition, if the same person goes back to the data set one more time, it is 
expected that similar classified results would be generated, unless the coder re¬ 
conceptualizes the research question or gains new insight after the first round of coding. 
The aforementioned scenarios have a quantitative equivalence in internal consistency and 
test-retest reliability. They are not about replicability between coders or generalizability 
in other contexts; rather, they refer to the quality of data interpretation within the same 
coder. However, a human coder is subject to many uncontrollable factors, such as fatigue, 
boredom, varying emotional states, and carelessness. Undoubtedly, text mining 
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algorithms can produce more consistent and verifiable results than a human coder. If 
Morse (1999) and Morse et al. (2002) accept the notion of reliability and validity as using 
rigorous standards to verify research results, then a high degree of compatibility between 
text mining and rigorous qualitative research truly exists. 

Conclusion 

In summary, text mining, as a data-driven research tool that allows categories to 
emerge from the data, shares the same goal with certain qualitative methods, such as 
grounded theory and content analysis. Both content analysis and text mining employ 
computer algorithms for counting words, but text mining goes further by interpreting the 
contexts of the words using natural language processing. However, it doesn’t necessarily 
imply that text mining is superior to content analysis. Text mining is confined to textual 
analysis whereas the scope of content analysis expands to audio and video. Last, if a 
well-written algorithm is used and the researcher is well-trained enough to make 
discernment on the categories, text mining could maintain a high degree of consistency, 
and this aspect of reliability is fully compatible with the criteria of sound qualitative 
research. 

In 2006 a conference entitled Bridging quantitative and qualitative methods for 
social sciences using text mining techniques ” was held by the National Centre for Text 
Mining in England and many promising ideas were proposed (Ananiadou, 2006; Frantzi, 
2006: Gibbs, 2006; Gillam, 2006; Lewins, 2006; Nasukawa, 2006; Wilson, 2006). 
However, most contributors for building this “bridge” are European and these “building 
blocks” are much less visible in the American methodological community. At the time of 
this writing, when the keywords “text mining and qualitative” are used for searching 
scholarly articles in major research databases, such as Academic Search Premier 
(EBSCOhost), ERIC, and PsycINFO, no entries are returned. In Electronic Journals 
Service there is just one article. Nevertheless, the objective and the criteria for rigorous 
research of text mining are fully compatible with that of qualitative research, and thus it 
is the hope of the authors that more attention will be paid to text mining by qualitative 
researchers. 


References 

Allport, G. W., & Odbert, H. S. (1936). Trait names: A psycho-lexical study. 
Psychological Monographs, 47, 211. 

Altheide, D., & Johnson, J. M. C. (1998). Criteria for assessing interpretive validity in 
qualitative research. In N. K. Denzin & Y. S. Lincoln (Eds.), Collecting and 
interpreting qualitative materials (pp. 283-312). Thousand Oaks, CA: Sage 
Publications. 

Ananiadou, S. (2006, April). Terminology management for text mining applications. 
Paper presented at National Centre for e-Social Science Workshop, Manchester, 
United Kingdom. 

Arkhurst, C. N. (1994). The thematic apperception test stories of women with and without 
histories of childhood sexual abuse. (Unpublished master’s thesis). City College 
of New York, New York. 



Chong Ho Yu, Angel Jannasch-Pennell, and Samuel DiGangi 


740 


Armstrong, D., Gosling, A., Weinman, J., & Martaeu, T. (1997). The place of inter-rater 
reliability in qualitative research: An empirical study. Sociology, 31, 597-606. 

Bales, R. F. (1950). Interaction process analysis. Cambridge, MA: Addison-Wesley. 

Blalock, H. M. (1964). Causal inferences in nonexperimental research. Chapel Hill, NC: 
University of North Carolina Press. 

Camillo, F., Tosi, M., & Traldi, T. (2005). Semiometric approach, qualitative research 
and text mining techniques for modeling the material culture of happiness. Berlin, 
Germany: Springer. 

Carey, J., Morgan, M., & Oxtoby, M. (1996). Inter-coder agreement in analysis of 
responses to open-ended interview questions: Examples from tuberculosis 
research. Cultural Anthropology Methods, 5(3), 1-5. 

Carnap, R. (1937). The logical syntax of language. London: Routledge & Kegan Paul Ltd. 

Chen, N., Kinshuk, Wei, C. W., & Chen, H. (2008). Mining e-Leaming domain concept 
map from academic articles. Computers & Education, 50, 1009-1021. 

Chomsky, N. (1957). Syntactic structures. The Hague, The Netherlands: Mouton. 

Cohen, A., & Hersh, W. (2005). A survey of current work in biomedical text mining. 
Briefings in Bioinformatics, 6, 57-71. 

Consoli, D. (2009). Analyzing customer opinions with text mining algorithms. AIP 
Conference Proceedings, 1148, 857-860. 

Dale, R., & Moisl, H. (2000). (Eds). Handbook of natural language processing. New 
York, NY: Marcel Dekker. 

Feist, J., & Feist, G. (2006). Theories of personality. Boston, MA: McGraw-Hill. 

Feldman, R., & Sanger, J. (2007). The text mining handbook: Advanced approaches in 
analyzing unstructured data. Cambridge: Cambridge University Press. 

Frantzi, K. (2006, April). Author identification. Paper presented at National Centre for e- 
Social Science Workshop, Manchester, United Kingdom. 

Gelbukh, A. (Ed.). (2007). Computational linguistics and intelligent text processing: 8th 
international conference, Mexico City, Mexico, February 18-24, 2007 

proceedings. New York, NY: Springer. 

Gibbs, G. (2006, April). Concordances and semi-automatic coding in qualitative 
analysis: Possibilities and barriers. Paper presented at National Centre for e- 
Social Science Workshop, Manchester, United Kingdom. 

Gillam, L. (2006, April). Sentiment analysis and financial grids. Paper presented at 
National Centre for e-Social Science Workshop, Manchester, United Kingdom. 

Glaser, B. G. (1978). Theoretical sensitivity. Mill Valley, CA: Sociology Press. 

Glaser, B. G. (1992). Basics of grounded theory analysis: Emergence vs. forcing. Mill 
Valley, CA: Sociology Press. 

Glaser, B. G., & Strauss, A. L. (1967). The discovery of grounded theory: Strategies for 
qualitative research. New York, NY: Aldine. 

Golafshani, N. (2003). Understanding reliability and validity in qualitative research. The 
Qualitative Report, 5(4), 597-607. Retrieved July 28, 2010, from 

http://www.nova.edu/ssss/QR/QR8-4/golafshani.pdf 

Guba, E., & Lincoln, Y. (1985). Effective evaluation: Improving the usefulness of 
evaluation. San Francisco, CA: Jossey Bass. 

Guba, E., & Lincoln, Y. (1989). Fourth generation evaluation. Thousand Oaks, CA: 
Sage. 



741 


The Qualitative Report May 2011 


Hall, C. S., & Van de Castle, R. L. (1966). The content analysis of dreams. New York, 
NY: Appleton-Century-Croft. 

Hall, J., & Stevens, P. (1991). Rigor in feminist research. Advances in Nursing Science, 
13(3), 16-29. 

Hearst, M. (1999, June). Untangling text data mining. Paper presented at the 37th Annual 
Meeting of the Association for Computational Linguistics, College Park, 
Maryland. 

Hong, C. F. (2009). Qualitative chance discovery - Extracting competitive advantages. 
Information Sciences, 179, 1570-1583. 

Huang, C. J., Chen, C. H., Luo, Y. C., Chen, H. X., & Chuang, Y. T. (2008). Developing 
an intelligent diagnosis and assessment e-leaming tool for introductory 
programming. Educational Technology & Society, 77(4), 139-157. 

Janasik, N., Honkela, T., & Braun, H. (2009). Text mining in qualitative research: 
Application of an unsupervised learning method. Organizational Research 
Methods, 12, 436-460. 

Jurafsky, D., & Martin, J. (2000). Speech and language processing: An introduction to 
natural language processing, computational linguistics, and speech recognition. 
Upper Saddle River, NJ: Prentice Hall. 

Kano, Y., Baumgartner W. A., Jr., McCrohon, L., Ananiadou, S., Cohen, K. B., Hunter 
L., & Tsuji, J. (2009). U-compare: Share and compare text mining tools with 
UIMA. Bioinformatics, 25, 1997-1998. doi:10.1093/bioinformatics/btp289. 

Kao, A., & Poteet, S. (Eds). (2007). Natural language processing and text mining. 
London: Springer. 

Kostoff, R. N., Bedford, C. D., del Rio, J. A., Cortes, H. D., & Karypis, G. (2004). 
Macromolecule mass spectrometry: Citation mining of user documents. Journal 
of the American Society for Mass Spectrometry, 15, 281-287. 

Kostoff, R. N., Block, J. A., Stump, J. A., & Pfeil, K. M. (2004). Infonnation content in 
Medline record fields. International Journal of Medical Informatics, 73, 515-527. 

Kostoff, R. N., & DeMarco, R. A. (2001). Extracting information from the literature by 
text mining. Analytical Chemistry, 73, 371-379. 

Kostoff, R. N., Johnson, D., del Rio, J. A., Bloomfield, L. A., Shlesinger, M. F., Malpohl, 
G. & Cortes, H. (2006). Duplicate publication and “paper inflation” in the fractals 
literature. Science & Engineering Ethics, 12, 543-554. 

Kostoff, R. N., Karpouzian, G., & Malpohl, G. (2005). Text mining the global abrupt- 
wing-stall literature. Journal of Aircraft, 42, 661-664. 

Kostoff, R. N., Morse, S. A., & Oncu, S. (2007). The seminal literature of Anthrax 
research. Critical Reviews in Microbiology, 33, 171-181. 

Koussounadis, A., Redfern, O., & Jones, D. (2009). Improving classification in protein 
structure databases using text mining. BMC Bioinformatics, 10, 1-14. 

doi: 10.1186/1471-2105-10-129. 

Krippendorff, K. (2004). Content analysis: An introduction to its methodology. Thousand 
Oaks, CA: Sage Publications. 

Lasswell, H. D., Leites, L., Fadner, R., Goldsen, J. M., Grey, A., Janis, I. L ...Yakobson, 
S. (1949). Language of politics: Studies in quantitative semantics. New York, 
NY: George W. Stewart. 



Chong Ho Yu, Angel Jannasch-Pennell, and Samuel DiGangi 


742 


Lee, C., & Hu, C. (2004). Analyzing hotel customers’ e-complaints from an Internet 
complaint forum. Journal of Travel & Tourism Marketing, 17, 167-181. 

Lewins, A. (2006, April). The CAQDAS networking project. Paper presented at National 
Centre for e-Social Science Workshop, Manchester, United Kingdom. 

Lin, F. R., Hsieh, L. S., & Chuang, F. T. (2009). Discovering genres of online discussion 
threads via text mining. Computers and Education, 52, 481-495. 

Lincoln, Y. S., & Guba, E. G. (1985). Naturalistic inquiry. Beverly Hills, CA: Sage. 

Long, T., & Johnson, M. (2000). Rigor, reliability and validity in qualitative research. 
Clinical Effectiveness in Nursing, 4, 30-37. 

Mani, I. (2001). Automatic summarization. Philadelphia, PA: J. Benjamins. 

Manning, C., & Schutze, H. (1999). Foundations of statistical natural language 
processing. Cambridge, MA: MIT Press. 

Mehler, A., & Kohler, R. (2007). Aspects of automatic text analysis. Berlin: Springer. 

Miles, M., & Huberman, M. (1994). Qualitative data analysis: An expanded sourcebook 
(2nd ed.). Thousand Oaks, CA: Sage Publications. 

Miller, T. (2005). Data and text mining: A business applications approach. Upper Saddle 
River, NJ: Pearson. 

Moret, M., Reuzel, R., van der Wilt, G., & Grin, J. (2007). Validity and reliability of 
qualitative data analysis: Inter-observer agreement in reconstructing interpretative 
frames. Field Methods, 79(1), 24-39. 

Morse, J. M. (1997). Perfectly healthy, but dead: The myth of inter-rater reliability. 
Qualitative Health Research, 7, 445-447. 

Morse, J. M. (1999). Myth #93: Reliability and validity are not relevant to qualitative 
inquiry. Qualitative Health Research, 9(6), 717-718. 

Morse, J. M., Barratt, M., Mayan, M., Olson, K., & Spiers, J. (2002). Verification 
strategies for establishing reliability and validity in qualitative research. 
International Journal of Qualitative Methods, 1, 11-23. 

Nasukawa, T. (2006, April). Text analysis within the knowledge mining project and 
sentiment analysis. Paper presented at National Centre for e-Social Science 
Workshop, Manchester, United Kingdom. 

Neuendorf, K. (2002). Content analysis guidebook. Thousand Oaks, CA: Sage 
Publications. 

Niemann, R., Niemann, S., Brazelle, R., van Staden, J., Heyns, M., & de Wet, C. (2000). 
Objectivity, reliability and validity in qualitative research. South African Journal 
of Education, 20(4), 283-286. 

Oxman, T., Rosenberg, S., Schnurr, P., & Tucker, G. (1985). Linguistic dimensions of 
affect and thought in somatization disorder. American Journal of Psychiatry, 142, 
1150-1155. 

Provalis Research. (2006). WordStat. [Computer software]. Montreal, Canada: Author. 

QSR International. (2007). Nivo 8. [Computer software]. Cambridge, MA: Author. 

Researchware, Inc. (2008). HYPERresearch 2.8. [Computer software]. Randolph, MA: 
Author. 

Romanowski, M. (2009). What you don’t know can hurt you: Textbook omissions and 
9/11. Clearing House, 82, 290-296. 

Russell, R., & Stiles, W. (1979). Categories for classifying language in psychotherapy. 
Psychological Bulletin, 86, 406-419. 



743 


The Qualitative Report May 2011 


Schlick, M. (1925/1974). General theory of knowledge. New York, NY: Springer-Verlag. 

Schlick, M. (1959). Positivism and realism. In A. J. Ayer (Ed.), Logical positivism (pp. 
82-107). New York, NY: Free Press. 

Schuldenfrei, R. (1972). Quine in perspective. Journal of Philosophy, 69, 5-16. 

Singh, N., Hu, C., & Roehl, W. (2007). Text mining a decade of progress in hospitality 
human resource management research: Identifying emerging thematic 
development. Hospitality Management, 26, 31-147. 

Singhal, A. (2001). Modem information retrieval: A brief overview. Bulletin of the IEEE 
Computer Society Technical Committee on Data Engineering, 24(4), 35-43. 

Smith, C. P. (2000). Content analysis and narrative analysis. In H. T. Reiss & C. M. Judd 
(Eds.), Handbook of research methods in social and personality psychology (pp. 
313-335). Cambridge, UK: Cambridge University Press. 

Spangler, S., Chen, Y., Proctor, L., Lelescu, A., Behai, A., He, B., .... Davis, T. (2009). 
COBRA - Mining web for corporate brand and reputation analysis. Web 
Intelligence & Agent Systems, 7, 243-254. 

SPSS Inc. (2006). SPSS text analysis for surweys 2.0 user’s guide. Chicago, IL: Author. 

Stone, P. J., Dunphy, D. C., Smith, M. S., & Ogilvie, D. M. (1966). The general inquirer: 
A computer approach to content analysis. Cambridge, MA: M.I.T. Press. 

Thompson, C., McCaughan, D., Cullum, N., Sheldon, T., & Raynor, P. (2004). 
Increasing the visibility of coding decisions in team-based qualitative research in 
nursing . International Journal of Nursing Studies, 7/(1). 15-20. 

Tsai, H. C. (2009). Chinese immigrant restaurant workers’ injury and illness experiences. 
Archives of Environmental and Occupational Health, 64, 107-114. 

Vellay, S. P., Latimer, N. E., & Paillard, G. (2009). Interactive text mining with pipeline 
pilot: A bibliographic Web-based tool for PubMed. Infectious Disorders-Drug 
Targets, 9, 366-374. 

VERBI Software. (2007). MAXQDA. [Computer software]. Sozialforschung, Germany: 
Author. 

Vilkinas, T. (2008). An exploratory study of the supervision of Ph.D./research students' 
theses. Innovative Higher Education, 32, 297-311. 

Wilson, A. (2006, April). Computer assisted content analysis. Paper presented at 
National Centre for e-Social Science Workshop, Manchester, United Kingdom. 

Winnenburg, R., Wachter, T., Plake, C., Dorns, A., & Schroeder, M. (2008). Facts from 
text: Can textmining help to scale-up high-quality manual curation of gene 
products with ontologies? Briefings in Bioinformatics, 9(6), 466-478. 

doi: 10.1093/bib/bbn043. 

Winter, D. G., & Carlson, L. A. (1988). Using motive scores in the psychobiographical 
study of an individual: The case of Richard Nixon. Journal of Personality, 56(1), 
75-103. 

Yao, L., Evans, J. A., & Rzhetsky, A. (2009). Novel opportunities for computational 
biology and sociology in drug discovery. Trends in Biotechnology, 27, 531-540. 

Yu, C. H. (2006). Philosophical foundations of quantitative research methodology. 
Lanham, MD: University Press of America. 

Zaremba, S., Ramos-Santacruz, M., Hampton, T., Shetty, P., Fedorko, J., Whitmore, J., 
... Pot, D. (2009). Text-mining of PubMed abstracts by natural language 
processing to create a public knowledge base on molecular mechanisms of 



Chong Ho Yu, Angel Jannasch-Pennell, and Samuel DiGangi 


744 


bacterial enteropathogens. BMC Bioinformatics, 10, 177-187. doi: 10.1186/1471 - 
2105-10-177. 


Author Note 

Chong Ho Yu has a Ph.D. in Educational Psychology with an emphasis on 
Measurement, Statistics, and Methodological Studies, and a Ph.D. in Philosophy with a 
concentration on Philosophy of Science. Currently he is Director of Research and 
Assessment at Applied Learning Technologies Institute, Arizona State University (ASU), 
and USA. His research activities include philosophical foundations of research 
methodologies and exploratory data analysis. Correspondence regarding this article can 
be addressed to Dr. Chong Ho Yu at 1475 North Scottsdale Road, Scottsdale, Arizona, 
85257-3538; Phone: 480-727-6978; E-mail:chonghoyu@gmail.com 

Dr. Angel Jannasch-Pennell is Assistant Vice President of University Technology 
at ASU, and Director of Research and Outreach initiatives in alt A I. She directs 
collaborative projects across Colleges and Centers, and also integrates community-based 
endeavors and University partnerships. Her research activities include human interface 
of instructional technology, innovative applications of instructional technology across 
different contexts, and large-scale educational assessment. Correspondence regarding 
this article can also be addressed to Dr. Angel Jannasch-Pennell at, ASU SkySong, 1475 
North Scottsdale Road, Scottsdale, Arizona 85257-3538; Phone: 480-727-6978; E-mail: 
angel@asu.edu 

Dr. Samuel DiGangi is Associate Vice President of University Technology at 
ASU, Associate Professor of Education, and Executive Director of ASU's Applied 
Learning Technologies Institute (alt A I). His research activities focus on infusing 
effective components of instructional design with emerging technology in education. He 
directs several sponsored research projects examining implementation of 
telecommunications and international networking in the classroom. Correspondence 
regarding this article can also be addressed to Dr. Samuel DiGangi at, ASU SkySong, 
1475 North Scottsdale Road, Scottsdale, Arizona 85257-3538; Phone: 480-727-6978; E- 
mail: sam@asu.edu 

Copyright 2011: Chong Ho Yu, Angel Jannasch-Pennell, Samuel DiGangi, and 
Nova Southeastern University 


Article Citation 

Yu, C. H., Jannasch-Pennell, A., & DiGangi, S. (2011). Compatibility between text 
mining and qualitative research in the perspectives of grounded theory, content 
analysis, and reliability. The Qualitative Report, 16(3), 730-744. Retrieved from 
http://www.nova.edu/ssss/QR/QR16-3/yu.pdf 




